Data Summary and Visualization

STA 101 - Summer I 2022

Raphael Morsomme

Welcome

Announcements

  • If you’re just joining the class, welcome! Go to the course website and review content you’ve missed, read the syllabus, and complete the Getting to know you survey.
  • Drop/Add for Term 1 ends tomorrow.

Recap of last lecture

  • observations (row) and variables (column)
  • population parameters and sample statistics
  • statistical inference
  • sampling
  • four types of variables
  • experiments, observational studies and causal claims
Types of variables are broken down into numerical (which can be discrete or continuous) and categorical (which can be ordinal or nominal).

Breakdown of variables into their respective types.

Source: IMS

Outline

  • Visualization for numerical data
  • Summary for numerical data
  • Visualization for categorical data
  • Summary for categorical data
  • More visualizations

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

7 Billion: Are You Typical?

National Geographic

Group exercise - data summaries?

  1. What variables are mentioned in the video?
  2. What are their types?
  3. How were they summarized and/or visualized?
03:00

Visualization for numerical data

US birth data

d_birth <- fivethirtyeight::US_births_1994_2003

There are 3,652 observations (rows)

nrow(d_birth) # number of rows
[1] 3652

and 6 variables (columns)

ncol(d_birth) # number of columns
[1] 6

head(d_birth)
# A tibble: 6 x 6
   year month date_of_month date       day_of_week births
  <int> <int>         <int> <date>     <ord>        <int>
1  1994     1             1 1994-01-01 Sat           8096
2  1994     1             2 1994-01-02 Sun           7772
3  1994     1             3 1994-01-03 Mon          10142
4  1994     1             4 1994-01-04 Tues         11248
5  1994     1             5 1994-01-05 Wed          11053
6  1994     1             6 1994-01-06 Thurs        11406

Histogram

ggplot(d_birth) +
  geom_histogram(aes(births))

  • Higher bars indicate where the data are relatively more common
  • More days with around 8,000 births or with around 12,500 births
  • Few days with less than 7,000 or more than 14,000 births.
  • Also few days with around 10,000 births

We can change the number of bins to have a rougher or more detailed histogram.

ggplot(d_birth) +
  geom_histogram(aes(births), bins = 10)

ggplot(d_birth) +
  geom_histogram(aes(births), bins = 100)

Statistics as an art - describing a distribution

It is always a good idea to make a histogram of continuous variables. To describe a distribution, we comment on

  • mode(s): unimodal, bimodal, multimodal
  • shape of each mode: flat, bell-shape, bounded
  • symmetry: symmetric, left skewed, right skewed
  • outliers: presence of extreme values

Note

Note that some distributions will not fit nicely in these categories.

Describing the US birth data

The distribution of the daily number of births in the US is bimodal with each mode being bell-shaped and symmetric. We observe no extreme value.

Group exercise - describing a distribution

Exercises 5.10

03:00

Scatterplots

d_car <- ggplot2::mpg
head(d_car)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

We will look at the relation between engine size (disp) and fuel efficiency (hwy).

ggplot(d_car) +
  geom_point(aes(displ, hwy))

Scatterplots are used to visualize the relation between two numerical variables.

Note

To add an additional variable to your visualization, you can use color or symbols.

ggplot(d_car) +
  geom_point(aes(displ, hwy, col = drv))

ggplot(d_car) +
  geom_point(aes(displ, hwy, shape = drv))

Summary for numerical data

Measures of centrality

  • The average: \(\bar{x} = \dfrac{x_1 + \dots + x_n}{n}\)

  • The median: the middle value

mean(d_birth$births)   # average
[1] 10876.82
median(d_birth$births) # median
[1] 11615

Percentiles

Percentiles are a generalization of the median.

Since the median value is larger than 50% of the data and smaller than the rest it is called the 50th percentile.

Similarly, the value that is larger than p% of the data and smaller than the rest is called the p-th percentile.

We will soon make use of the 25th and 75th percentiles.

Later in the course, the 95th and 97.5th percentiles will also be useful.

Measures of variation

  • Variance: average squared distance from the mean
    • Standard deviation (sd): square root of the variance (roughly speaking, the average distance to the mean)
    • Most (+- 95%) of the data is within 2 sd of the mean.
  • Inter-quartile range (IQR): distance between the 25th and the 75th percentiles.
var(d_birth$births) # variance
[1] 3454270
sd(d_birth$births) # sd
[1] 1858.567
IQR(d_birth$births) # iqr
[1] 3429.75

Robustness

Real-world data often contain extreme values - measurement error, - typo - …

The average, median, variance, sd and iqr are not equally robust to the presence of extreme values.

Let us contaminate the birth data with a value of 1 billion…

x_uncontaminated <- d_birth$births  
x_contaminated   <- c(x_uncontaminated, 1e9)

…and compare the mean, median, variance, sd and iqr of these two variables

summary(x_uncontaminated)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6443    8844   11615   10877   12274   14540 
summary(x_contaminated)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
6.443e+03 8.845e+03 1.162e+04 2.846e+05 1.228e+04 1.000e+09 
var(x_uncontaminated); var(x_contaminated)
[1] 3454270
[1] 2.737417e+14
sd(x_uncontaminated); sd(x_contaminated)
[1] 1858.567
[1] 16545140
IQR(x_uncontaminated); IQR(x_contaminated)
[1] 3429.75
[1] 3430

Robustness of the median and the iqr

While the median and iqr are robust to the presence of extreme values, the mean, the variance and the sd are not.

Group exercise - summary statistics

Exercises 5.8, 5.11, 5.15 (replace part \(c\) by height of adults)

Note: Q1 is first the 25th percentile (larger than one quarter of the data), Q3 is the 75th percentile.

06:00

Summary for categorical data

Frequency table (1d)

head(d_car)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~
table(d_car$drv)

  4   f   r 
103 106  25 

Contigency table (2d)

table(d_car$class, d_car$drv)
            
              4  f  r
  2seater     0  0  5
  compact    12 35  0
  midsize     3 38  0
  minivan     0 11  0
  pickup     33  0  0
  subcompact  4 22  9
  suv        51  0 11

Proportion table (2d)

table(d_car$class, d_car$drv) %>%
  prop.table() %>%
  round(2)
            
                4    f    r
  2seater    0.00 0.00 0.02
  compact    0.05 0.15 0.00
  midsize    0.01 0.16 0.00
  minivan    0.00 0.05 0.00
  pickup     0.14 0.00 0.00
  subcompact 0.02 0.09 0.04
  suv        0.22 0.00 0.05
table(d_car$class, d_car$drv) %>%
  prop.table(1) %>%
  round(2)
            
                4    f    r
  2seater    0.00 0.00 1.00
  compact    0.26 0.74 0.00
  midsize    0.07 0.93 0.00
  minivan    0.00 1.00 0.00
  pickup     1.00 0.00 0.00
  subcompact 0.11 0.63 0.26
  suv        0.82 0.00 0.18
table(d_car$class, d_car$drv) %>%
  prop.table(2) %>%
  round(2)
            
                4    f    r
  2seater    0.00 0.00 0.20
  compact    0.12 0.33 0.00
  midsize    0.03 0.36 0.00
  minivan    0.00 0.10 0.00
  pickup     0.32 0.00 0.00
  subcompact 0.04 0.21 0.36
  suv        0.50 0.00 0.44

Group exercise - contigency and proportion table

  1. What does the number \(12\) represent in the contigency table?
  2. What does the number \(0.05\) (2nd row, 1st column) represent in the first proportion table?
  3. What does the number \(0.25\) (2nd row, 1st column) represent in the row proportion table?
04:00

Visualization for categorical data

Barplot

ggplot(d_car) +
  geom_bar(aes(drv))

Advanced barplots

ggplot(d_car) +
  geom_bar(aes(drv, fill = class))

ggplot(d_car) +
  geom_bar(aes(drv, fill = class), position = "dodge")

ggplot(d_car) +
  geom_bar(aes(drv, fill = class), position = "fill")

Group exercise - pros and cons of barplots

Exercise 4.5

03:00

Advanced visualizations

Facetted histograms

d_birth_small <- filter(d_birth, year %in% c(1994, 1997, 2000, 2003))
ggplot(d_birth_small) +
  geom_histogram(aes(births)) + 
  facet_grid(year~.)

Mosaic plot

ggplot(d_car) +
  geom_mosaic(aes(x = product(drv), fill = class))

✅ Combines the strengths of the various barplots

🛑 Not in the tool box of every data scientist

Boxplots

from R4DS, Wickham and Grolemund

  • The thick line in the middle of the box indicates the median

  • The box stretches from the 25th percentile to the 75th percentile; it covers 50% of the data.

  • The length of the whiskers are at most 1.5 iqr

  • Any observation more than 1.5 iqr away from the box is labelled as an outlier.

Outliers

Outliers have an extreme value. How to deal with an outlier depends on why the observation stands out.

Group exercise - types of associations

Exercise 5.13

02:00

ggplot(d_birth) +
  geom_boxplot(aes(y = births))

ggplot(d_birth) +
  geom_boxplot(aes(y=births, x=day_of_week))

Editing figures

Figure title

ggplot(d_car) +
  geom_point(aes(displ, hwy)) +
  labs(title = "Fuel consumption on the highway per engine size")

Axis title

ggplot(d_car) +
  geom_point(aes(displ, hwy)) +
  labs(
    title = "Fuel consumption on the highway per engine size",
    x = "Engine size (number of cylinders)",
    y = "Fuel efficiency on the highway (mpg)"
    )

ggplot(d_car) +
  geom_point(aes(displ, hwy)) +
  labs(
    title = "Fuel consumption on the highway per engine size",
    x = "Engine size (number of cylinders)",
    y = "Fuel efficiency on the highway (mpg)"
    ) +
  theme_bw()

ggplot(d_car) +
  geom_point(aes(displ, hwy)) +
  labs(
    title = "Fuel consumption on the highway per engine size",
    x = "Engine size (number of cylinders)",
    y = "Fuel efficiency on the highway (mpg)"
    ) +
  theme_classic()

ggplot(d_car) +
  geom_point(aes(displ, hwy)) +
  labs(
    title = "Fuel consumption on the highway per engine size",
    x = "Engine size (number of cylinders)",
    y = "Fuel efficiency on the highway (mpg)"
    ) +
  theme_dark()

Editing tables

table(d_car$class, d_car$drv) %>%
  prop.table(1) %>%
  round(2) %>%
  kbl(caption = "Distribution of drive type per class of car") %>%
  kable_classic(full_width = FALSE, c("striped", "hover"))
Distribution of drive type per class of car
4 f r
2seater 0.00 0.00 1.00
compact 0.26 0.74 0.00
midsize 0.07 0.93 0.00
minivan 0.00 1.00 0.00
pickup 1.00 0.00 0.00
subcompact 0.11 0.63 0.26
suv 0.82 0.00 0.18

📋 See this vignette for more details on editing tables

Effective communication

Statistics as an art - figures

  • Have a purpose: is the figure necessary?

  • Pasimony: keep it simple and avoid distractions

  • Tell a story: provide context and interpret the figure

  • \(\ge3\) variables as much as possible: color, facets, etc.

  • Edit your figure: title, axes, etc

. . .

📋 See R for Data Science - chapters 3 and 7 for more on data visualization in R.

Recap

Recap

  • Histogram, scatterplot, boxplot
  • Average, median, variance, sd and IQR; robustness
  • Frequency, contigency and proportion tables
  • Barplot, mosaic plot
  • Effective communication: well-edited figures, \(\ge3\) variables (symbols, colors, facets)
  • R for Data Science - chapters 3 and 7

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey